EDA on Red Wine Quality

By: Sunda Gerard

knitr::opts_chunk$set(echo = FALSE)

# Importing the libraries


library(ggplot2)
library(ggridges)
library(reshape2)
library(dplyr)
library(tidyr)
library(gridExtra)
library(GGally)
library(memisc)
library(Hmisc)
library(pander)
library(corrplot)

#Importing the data into RStudio

df <- read.csv('wineQualityReds.csv')

Abstract

This project is for the exploration of data on Red Wine and the chemical properties of the wine that may affect the quality of the wine. The project data is imported into RStudio for more exploratory data analysis.

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

A few personal observations so far of the data:

There are 13 variables and 1599 observations X is an indexing variable for identifying the individual red wines. Since this does not contribute to the analysis, we will remove this variable. *There are 12 other variables, although one of them (quality) is the ultimate deciding factor on how the wine tastes. The remaining 11 variables affect the ultimate quality of the wine.

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

Now that the X variable is removed, we can focus in our the data analysis within our dataset.

Univariate Plots Section

First, I want to see what each variable looks like graphed. I will create individual histograms of each variable.

Looing at our analysis, we can see the following types of distributions: Normal: Volatile.acidity, density, and pH are all skewed as normal distributions. Positively skewed: Fixed.acidity, citric.acid, free.sulfur.dioxide, total.sulfur.dioxide, suplhates, and alcohol are all skewed in long-tail. Outliers: Many of the variables have extreme outliers including suphates, total.sulfur.dioxide, chlorides, and residual.sugar. Quality: Many of the wines are in the 5 or 6 quality range. This seems to suggest that there aren’t many terribl tasting red wines on this list, and only a few great tasting ones.

To place the positively skewed distributions into a normal distribution, we can call the sqrt command to make this change.

#Univariate Analysis

What is the structure of your dataset?

There are 12 variables, with 1,599 observations.

What is/are the main feature(s) of interest in your dataset?

The main interest in the dataset is what variables positively and negatively affect the quality of red wine.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I believe that factors including, but not limited to alcohol, density, residual sugar, and pH will make a difference in the quality of the wine.

Did you create any new variables from existing variables in the dataset?

We normalized a few variables that were previously skewed distributions.

Of the features you investigated, were there any unusual distributions?

The Fixed.acidity, citric.acid, free.sulfur.dioxide, total.sulfur.dioxide, suplhates, and alcohol variables were positively skewed and later converted into a more normal distribution.

Suphates, total.sulfur.dioxide, chlorides, and residual.sugar all had extreme outliers.

Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I used the sqrt function to tidy the Fixed.acidity, citric.acid, free.sulfur.dioxide, total.sulfur.dioxide, suplhates, and alcohol variables into a normal distribution.

I removed the X variable, which was used for indexing in the dataset.

Bivariate Plots Section

The first relationship I am interested in determining is between free.sulfur.dioxide and total.sulfur.dioxide. Not knowing much about either or how it translates into making wine, I am interested in seeing if the variables are correlated in any way.

The relationship seems to show a positively correlated relationship between the free.sulfur.dioxide and total.sulfur.dioxide. As one increase, so does the other in most cases.

The next relationship I wanted to examine was between citric.acid and quality. It would seem that freshness and flavor that citric.acid brings would make the wine taste better. There is no strong correlation apparent between citric.acid and the quality of the wine, however, most wines have a citric acid below .5 and are between 4.5 and 6.5 in quality. A small impact relationship is present, but not enough that we can definitively say that citric.acid affects the quality of wine strongly.

I am now going to see if residual.sugar has any strong effect on the quality of wine. Taste preferences can vary for sweet or dry wine, so I believe that this will have no strong correlation to quality.

Surprise! The residual.sugar amount of most of the wines already less than 4, but they all have varying levels of quality. This shows that dry wines are typically rated higher quality than those that are sweet, but some of this may be related to the fact that most of the wines are not sweet to begin with. Like citric.acid, the quality of the wines are mostly between 4.5 and 6.5, with a chunk also around 7. Still, we can say that this is a correlation that needs more examination.

Next, I want to examine chlorides or the amount of salt in wines and see if there is any relationship to quality of the wine.

It looks like there is some correlation between the amount of chlorides in wine and the quality of the wine. The less chloride in the wine, the less salty the wine, which seems to mean a better taste. Most of the wines do not have many chlorides though, with a majority being less than 0.2. This breakdown looks fairly similar to the residual.sugar plot. Since they are similar, I want to compare the two variables on a plot. Definitely a lot of correlation there between the two variables, but a few outliers. Next I want to see if there is any relation between the alcohol content and quality of wine. There is a strong correlation here between alcohol content and the quality of wine. As the alcohol content increases, so does the quality in most cases. Alcohol content less than 10 for the most part has a wine quality between 4.5 and 6.5, with a few outliers. As we step up in alcohol content between 10 an 12, the quality varies between 4.5 and 6.5 again, but also has a lot of wines in the 6.5 to 7.5 quality. Most wines with alcohol content greater than 12 have a quality of 6.5 or greater.

Next, I want to take a look at volatile.acidity to see if there is a relationship between this variable and the quality of wine. High amounts of volatile.acidity usually makes wine have a vinegar taste that is strong and pugnent.

It becomes apparent that the lower volatile.acidity translates into higher quality wines, whereas higher volatile.acidity results in mostly lower quality wines. This makes sense because people want to drink wine, not drink vinegar.

Next, I want to look at fixed.acidity to see if there is correlation to wine quality.

There are low amounts of fixed.acidity across the board , with most below 12 and a majority below 10. Concentrations of wine qualities of 5 and 6 are prevelant. Still, we do not see a strong correlation here, but this variable should be investigated further in the multivariate analysis section.

I want to see if density has any affect on wine quality. Although most of the density for the wines are between .995 and 1, there seems to be no real correlation between the density and quality of the wine.

Next, I want to investigate if pH has any effect on wine quality. This plot looks very similar to the density plot before this one. Most of the wines are between 3 and 3.5 pH, while the quality is all over the place. I don’t see any correlation from this plot, but we will investigate further shortly.

Lastly, we will look at sulphates and it there is correlation to wine quality.

Most all of the wines are less than 1 sulphate, but the wine quality is concentrated in three spots, at 5, 6, and 7. As the number of sulphates increase, so does the quality in general. There is not enough to really say there is a strong correlation, but some relationship is evident.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

I observed that alcohol and volatile.acidity were the most telling variable relationships that affected the quality of wine. There was also noticable effects from citric.acid and sulphates. More examination will be needed for both residual.sugar and chlorides, although those both were observed to be small factors in determining the quality of wine.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I was surprised that residual.sugar and chlorides were closely correlated.

What was the strongest relationship you found?

Finding that alcohol content greatly affected the taste and quality of red wine was the strongest relationship variable.

Multivariate Plots Section

I want to examine further the four variables that we concluded may have a relationship to the quality of wine: alcohol, fixed.acidity, volatile.acidity, and pH. I wanted to re-examine the relationship between fixed.acidity and quality of red wine, with an emphasis on the alcohol content. I focused on the alcohol content since we are certain that is a factor in the quality of wine. It looks like the obvious of higher alcohol content equaling higher wine quality is displayed. Since most of the quality is all over the place on the plot and sulphates are almost all between 0.5 and 1.0, we can say that there is a correlation between sulphates and the quality of red wine, just not one as strong as the relationship between alcohol and quality.

Next, I want to see how volative acidity is related to alcohol and the quality of wine.
We can also see in this plot that high quality wines tend to contain higher alcohol content levels. We can see that most higher quality wines have low volatile.acidity levels. This allows us to see that there is an impactful negative correlation. We will see how impacful the correlation is shortly.

I want to see if citric.acid and the alcohol content have strong impacts on the quality of the wine.

We can see by the plot that there is a relationship between the alcohol content of wine and the citric.acid of wine. Most are between .25 and .75-so there is a sweet spot for making good quality wine, but higher alcohol content seems to continue to make good wines in general.

I wanted to see if there was a negative relationship between citric.acid and volatile.acidity, similar to the relationship between quality and volatile.acidity.

After this plot, we can see that most wines have between 0 and 0.75 citric.acid, but most higher quality wines are between 0.25 and 0.75. There is a negative correlation between volatile.acidity and citric.acid, similar to that of quality and volatile.acidity. We can at this point know that there is an impact of both citric.acid and volatile.acidity on the quality of wine.

The last plot I want to make for multivariate is to determine the relationship between pH and alcohol content of our wines.

We can determine by the plot that there is a positive relationship between alcohol content and the pH of wines in our dataset. As the alcohol content increases, generally the pH of wine also increases. This is especially true with our wines that scored a 6, 7, or 8 in quality. Since these two are correlated, we can say that pH does have some impact on the quality of wine.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

There is positive correlation between the pH content and alcohol content of wines. Volatile.acidity and citric.acid have a negative correlation. Citric.acid and alcohol content have a positive correlation. Sulphates and quality of wine have a positive correlation. Finally, there is a positive correlation between alcohol content and quality of wine.

Were there any interesting or surprising interactions between features?

I am surprised that some of the variables appear to be related strongly and that some variables are not clearly correlated with quality or with each other.

Final Plots and Summary

Plot One

The first plot will be a number correlation matrix that will examine the relationship between the variables and determine the amount of positive or negative correlation to each other.

Description One

As we can see, there are a lot of interesting finds that provide clarification on the dataset. This plot also gives us concrete relationships and their significance level at a .05 confidence interval. We can see that the relationship between alcohol content and quality is .48, which shows a strong correlation between those variables. Other strong positive correlations to quality include sulphates at .25, citric.acid at .23, and fixed.acidity at .12. Quality is negatively correlated with volatile.acidity at -.39, followed further behind by total.sulfur.dioxide at -.19, density at -.17, and chlorides at -.13.

Plot Two

The next plot I want to examine the relationship between the negative correlation of volatile.acidity and the positive correlation of alcohol related to wine quality. We will display this on a heat map plot.

Description Two

The heat map plot shows the strong correlation between the two variables and the impact on wine qualities. We see a lot more red dots at the top left of the plot indicating that higher wine quality is related to higher alcohol levels and lower volatile.acidity. Green dots indicating lower wine quality are more present at the bottom of the plot where alcohol content is lower and are also more present where volatile.acidity is higher.

Plot Three

The final plot will examine the mean or average of wine quality relative to the alcohol content. This should be interesting to see if there is a constant upward movement in the mean of alcohol content relative to quality. It will also let us see amounts of each quality level of wine.

Description Three

There are some interesting results here in the plot. I was not expecting a dip in the alcohol content across the quality levels. We see that a wine quality of 3 starts with a mean alcohol content of 10. There are also not a lot of wines with a quality of 3. We then see a slight uptick in alcohol content with wines that have a quality of 4. There are also more wines that have a quality of 4 than 3. The interesting aspect is that wines with a quality rating of 5 have an average reduction in alcohol content, which dips down just below 10. We also see many more wines at this quality level than either 3 or 4. We experience another increase in alcohol content with wine quality levels of 6. We are somewhere at an average of between 10 and 11 in alcohol content at quality level 6. There is also a much bigger dispersement of the alcohol content at this quality level, with a majority between an alcohol content level of 9 and 13. We start to see a decrease in the amount of wines represented at quality level 7. We also see a continuous increase of the alcohol level to an average of between 11 and 12. We finally see that there are a little over a dozen wines at a quality level of 8 and that the average alcohol content of those wines also increases to an average level of 12.

Reflection

As we found out, there isn’t a distinct formula to making great quality wine. Taste is different for every individual. Sweetness for some makes a better wine, while others prefer dry wines. I honestly expected wines that were more sweet to be of higher quality. Through investigation, our dataset indicated a strong positive correlation between alcohol content and the quality of wine. Typically, the stronger and higher the alcohol content, the higher quality wine. Although some of the data suggests that this is not always true, and other factors can cause the wine to be of varying quality. If higher alcohol content was equivelant to higher quality wine, then winemakers would simply spike the wine to increase the alcohol content. Other factors such as sulphates and citric acid can influence the taste of wine in a positive way. Factors such as volatile acidity can negatively influence the taste of wine.

We found other factors that can ruin the taste of wine if not handled properly including chorides, density, and total sulfur dioxide. Managing those negative influences of wine quality would assist winemakers in producing better quality wines, but not necessarily make an award winning wine. A sample of 1599 red wines is a start to examine the relationship between wine variables and the quality of wine, but further testing and data examination would be necessary before finding conclusive data that would support our findings in this study.